Biological Pattern Discovery with R Machine Learning Approaches (Zheng Rong Yang)

tion of the k class of data points using the partitioning rule (x

ini index is defined as below,

ܫீሺݔൌ߬ሻൌ෍݌௞ሺ߬ሻሾ1 െ݌௞ሺ߬ሻሿ

௄

௞ୀଵ

(3.75)

be seen that, if a subspace is pure for one class, ݌௞ሺ߬ሻ = 1 and

ሻൌ0. Or, ݌௞ሺ߬ሻ = 0 and 1 െ݌௞ሺ߬ሻൌ1. Therefore, ݌௞ሺ߬ሻሾ1 െ

0. If all subspaces are pure for one class, ܫீሺݔൌ߬ሻൌ0. If a

ng rule generates a random classification model, ݌௞ሺ߬ሻ = 0.5.

e of ܫீሺݔൌ߬ሻ is 0.25 ൈܭ. Therefore, the minimum value of the

ex is zero when a set of partitioning rules constitute a perfect

tion model and the maximum value of the Gini index is 0.25K.

ther measurement is called the information gain based on the

heory, which is defined as below, where logଶ݌௞ሺ߬ሻ has been

by logଶሺ1 ൅݌௞ሺ߬ሻሻ for avoiding an infinite value in case ݌௞ሺ߬ሻ

ching zero,

ܫாሺݔൌ߬ሻൌ෍݌௞ሺ߬ሻlogଶሺ1 ൅݌௞ሺ߬ሻሻ

௄

௞ୀଵ

(3.76)

nformation gain should be maximised for getting a better

ng rule. If a partitioning rule generates a pure subspace, ݌௞ሺ߬ሻൌ

ore, ݌௞ሺ߬ሻlogଶሺ1 ൅݌௞ሺ߬ሻሻൌ1. The maximum information gain

(a) (b)

The impurity measurement for partitioning rules applied to the data shown in

(a). (a) The Gini index. (b) The entropy index (information gain).